# Random elements of this notebook will evaluate identically
# across users if you run this line of code first.
set.seed(87295)Introduction to R
R fundamentals
This section covers functions, variables, installing packages, loading data, vectors, dataframes, wrangling, and visualization. We work with tidyverse. I recommend this lesson to most anyone using R for data analysis.
Hello World
This file is called a quarto notebook.
It contains a combination of markdown style text, and code chunks.
Markdown is a simple style that allows you to create headings of decreasing size with #, ##, ###, … bulleted lists with -, bold with bold, and italics with italics.
There are other more advanced capabilities you can read about online.
We can also make code cells
printis called a function. Many of the tools we will use are all functions built by someone else."Hello World!"is called a string. A string is a kind of data made up of text. We enclose strings in quotes.
# This is a comment that R ignores when parsing code
print("Hello World!")[1] "Hello World!"
- Every function is called by typing out:
function.name( argument ); some people might say execute or run a function. In our first program above,function.nameis the wordprintandargumentis the string"Hello World".
- You can run a code cell by clicking Run Cell or pressing Cmd + Shft + Enter. You can run the line where your cursor is by pressing Cmd + Enter.
- You can render a quarto notebook by clicking the quarto symbol in the upper right. More on that at the end.
Every programmer begins their journey by greeting the new world they are about to join. Create a code chunk called “print function”. Call the print function on a string of your choice.
{Your code chunk goes here}
Calculator
- R is a fancy calculator.
(2 + 3) * 5[1] 25
2^3[1] 8
2^3 / 3[1] 2.666667
R is a fancy calculator. Make a code chunk below under an exercise and title it calculator.
Write one or more expressions that use each of addition, subtraction, multiplication, division, exponentiation, and parentheses.
{Your code chunk here}
Write an expression that R computes correctly, but which a simple calculator (such as might be on a phone) would get wrong if entered the same way.
{Your code chunk here}
I’m planning an event for 329 people who all need transportation from Denver to Fort Collins. We have charter buses that hold 64 people each, vans that hold 10 people each. The buses are very expensive, so I only want to use one if it will be full. The vans are cheaper, but still still expensive, so I only want to use full vans too. Using the fewest total vehicles possible, how many full buses and full vans do I need to carry everyone? How many people are leftover who I might want to convince to carpool? Hint below.
{Your code chunk here}
Hint: Use the modulo operato.
Variables
- Programming languages are languages.
- Variables have expressive names and values that they represent.
- To create a variable, we use the assignment operator, <-
# Estimate the number of eyes in the zoom room
num.per.animal <- 2
num.people <-
num.pets <-
total.eyes <- (num.people + num.pets) * num.per.animalError in eval(expr, envir, enclos): object 'num.people' not found
num.pets <- 3
num.pets <- num.pets + 4
# This line produces an error because you can't assign to a number
# 3 <- 19- Variables can be reassigned to new values.
- Periods, numbers, underscores, letters are all okay. Can’t start with a number.
- Click the R extension icon in the left sidebar of Visual Studio Code. It will open a pane that contains a section called “Global Environment”. In here, will see all the variables you have created and some basic information about them, such as their type. Click the icon with the magnifying glass to viwe them in detail. Click the “x” to delete them.
In the calculator above, the numbers are meaningless. They could represent anything. Now we’re going to give them some meaning.
Create a variable called two and assign the value -1. Create a variable called number and assign the value 6. Create a variable called number1 and assign the value 2. Then, compute two times number raised to number1 power.
{Your code chunk here}
Reflect on the choice of variables I suggested, and consider the importance of choosing descriptive names. Repeat the problem using new variables with better names.
Consider the variable assignments below. If you modify the value of x, does the value of z change? If you modify the value of z, does the value of x change?
x <- "x"
z <- x{Your answer here}
Installing packages + loading data part 1
- Packages are collections of functions and data that someone else programmed for you. Don’t reinvent the wheel!
# To install a new package for the first time:
# install.packages("tidyverse")
# After running this once, you never have to run it again until you update R or the package.
# Tell R to use a package in the current session.
# You have to call this once every time you start the R interpreter.
library("tidyverse")── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tidyversehelps us with data cleaning, wrangling and visualization.- It is a wrapper around many smaller packages like
dplyr,stringr,tidyr,ggplot2and others. - Conflicts occur when you load in functions or data from two different packages that have the same name.
- If you wish to specify which version of filter to use, you must use the package name prefix.
dplyr::orstats::beforefilter. If you just usefilter, R will calldplyr::filterbecause it was loaded most recently (masks).
Install a package called fivethirtyeight and load it into your namespace. Find its documentation online and read about some of the datasets it provides. The datasets are in your current namespace, so you can refer to any of them by name. Start by saving the one about births since 2000 to a variable called births.
{your answer here}
# Run this line once for each R version
# install.packages("fivethirtyeight")
library(fivethirtyeight)Some larger datasets need to be installed separately, like senators and
house_district_forecast. To install these, we recommend you install the
fivethirtyeightdata package by running:
install.packages('fivethirtyeightdata', repos =
'https://fivethirtyeightdata.github.io/drat/', type = 'source')
births <- US_births_2000_2014Inspecting dataframes + loading data part 2
- Let’s see some ways of viewing the data nicely. All of the following are functions available either through base R or
tidyverse.
# View the whole dataframe
# births
str(births)tibble [5,479 × 6] (S3: tbl_df/tbl/data.frame)
$ year : int [1:5479] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
$ month : int [1:5479] 1 1 1 1 1 1 1 1 1 1 ...
$ date_of_month: int [1:5479] 1 2 3 4 5 6 7 8 9 10 ...
$ date : Date[1:5479], format: "2000-01-01" "2000-01-02" ...
$ day_of_week : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 7 1 2 3 4 5 6 7 1 2 ...
$ births : int [1:5479] 9083 8006 11363 13032 12558 12466 12516 8934 7949 11668 ...
glimpse(births)Rows: 5,479
Columns: 6
$ year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ date_of_month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ date <date> 2000-01-01, 2000-01-02, 2000-01-03, 2000-01-04, 2000-01…
$ day_of_week <ord> Sat, Sun, Mon, Tues, Wed, Thurs, Fri, Sat, Sun, Mon, Tue…
$ births <int> 9083, 8006, 11363, 13032, 12558, 12466, 12516, 8934, 794…
head(births)# A tibble: 6 × 6
year month date_of_month date day_of_week births
<int> <int> <int> <date> <ord> <int>
1 2000 1 1 2000-01-01 Sat 9083
2 2000 1 2 2000-01-02 Sun 8006
3 2000 1 3 2000-01-03 Mon 11363
4 2000 1 4 2000-01-04 Tues 13032
5 2000 1 5 2000-01-05 Wed 12558
6 2000 1 6 2000-01-06 Thurs 12466
tail(births)# A tibble: 6 × 6
year month date_of_month date day_of_week births
<int> <int> <int> <date> <ord> <int>
1 2014 12 26 2014-12-26 Fri 10386
2 2014 12 27 2014-12-27 Sat 8656
3 2014 12 28 2014-12-28 Sun 7724
4 2014 12 29 2014-12-29 Mon 12811
5 2014 12 30 2014-12-30 Tues 13634
6 2014 12 31 2014-12-31 Wed 11990
nrow(births)[1] 5479
dim(births)[1] 5479 6
names(births)[1] "year" "month" "date_of_month" "date"
[5] "day_of_week" "births"
summary(births) year month date_of_month date
Min. :2000 Min. : 1.000 Min. : 1.00 Min. :2000-01-01
1st Qu.:2003 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2003-10-01
Median :2007 Median : 7.000 Median :16.00 Median :2007-07-02
Mean :2007 Mean : 6.523 Mean :15.73 Mean :2007-07-02
3rd Qu.:2011 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:2011-04-01
Max. :2014 Max. :12.000 Max. :31.00 Max. :2014-12-31
day_of_week births
Sun :783 Min. : 5728
Mon :783 1st Qu.: 8740
Tues :783 Median :12343
Wed :783 Mean :11350
Thurs:782 3rd Qu.:13082
Fri :782 Max. :16081
Sat :783
# This line will produce an error when rendered because it tries to open a window.
View(births)Error in check_for_XQuartz(file.path(R.home("modules"), "R_de.so")): X11 library is missing: install XQuartz from www.xquartz.org
- Integer, date, and ordered factor datatypes.
- Dates are a special data type that understands how dates work.
date()[1] "Mon Jan 6 08:42:25 2025"
arbitrary.day <- ymd("2023-12-22")
following.day <- arbitrary.day + 1
arbitrary.day + 365[1] "2024-12-21"
classis a function we can use to check out the type of an R object.
class(births$day_of_week)[1] "ordered" "factor"
- Factors are a special datatype for categorical data.
factor(c("small", "small", "medium"), levels=c("small", "medium", "large"))[1] small small medium
Levels: small medium large
births$day_of_week %>% head()[1] Sat Sun Mon Tues Wed Thurs
Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
as.numeric(births$day_of_week) %>% head()[1] 7 1 2 3 4 5
Factors can have orderings, like the days of the week or sizes. Strings don’t have this feature.
Factors also store all the possible levels, even if that level doesn’t appear in the data.
These features are essential for using regression in R. The “first” level in the factor will always become part of the intercept term.
You’re going to get a chance to inspect your own data in just a minute.
Let’s look at one more important way that many of you will load information into R, via files.
The R interpreter has a working directory (directory is another word for folder).
# Show the working directory
getwd()[1] "/Users/ellery/My Drive/Documents-GC/cu/crdds/datacamp/datacamp2501"
list.files() [1] "data" "exercises" "exercises.qmd" "export"
[5] "html" "notes_cache" "notes_files" "notes.html"
[9] "notes.qmd" "notes.rmarkdown" "resources" "setup.R"
[13] "solutions"
Files you reference have paths relative to the working directory. A path consists in a sequence of folders separated by slashes (
/) ending in a file name (e.g. “myfile.csv”).To get my notes on this subject called “working.directory.txt” located within “export” within “text.notes”, I would use.
# Read lines makes each line in the text file into a string and returns
# a vector of strings.
wd.notes <- readLines("data/text/working.directory.txt")- In general, to load a file into R, put the file into your working directory, or a subdirectory, and provide R with the path the file.
Above, we got data from the fivethirtyeight package. Now we’re going to get data by loading a csv file.
Similar to how I used “readLines”, use the “read_csv” function to make a dataframe out of the csv file called “iris.csv” inside the “data” subdirectory. Save the data to a variable called irises. If you run into problems, remember to check your working directory.
Inspect the irises dataset using any of the approaches we’ve introduced so far. Answer the following questions:
- How many irises were measured?
- What quantities were measured?
- What are the first, second, and third quartiles for petal length?
- We’ve seen how dataframes contain our data as whole, but how do we break them down? What are their parts? How do we make them ourselves?
Vectors
- Here’s how to look at just one column in a dataframe.
dataframe.name$column.name.
births$year %>% head()[1] 2000 2000 2000 2000 2000 2000
births$births %>% head()[1] 9083 8006 11363 13032 12558 12466
This object is called an atomic vector, and it consists in an ordered sequence of R objects of the same type.
Be careful not to call it a list, as there is a different data type called a list with slightly different properties and uses.
The square bracketed number at the beginning of each line are indices. They indicate that the first vector element on that line is, say, the 5461 st element of the vector.
Why are vectors useful? Computers are most helpful when the same operation must be done many times.
Let’s make our own vectors now.
red.apples <- c(2,3,5)
green.apples <- c(5, 8, 2)cis another function. c stands for combine.
- When functions take multiple arguments, we separate them with commas. Spaces don’t matter, but are good for style.
- You will use the
cfunction whenever you need to provide multiple values to a function as a single argument. e.g. dimensions = c(width, height). - Vector arithmetic is elementwise and builtin
- Most functions in R are vectorized
red.apples <- c(2,3,5)
green.apples <- c(5, 8, 2)
total.apples <- red.apples + green.apples
all.apples <- c(red.apples, green.apples)
them.apples <- red.apples + red.apples
them.apples <- them.apples * green.apples- The c function can also combine two or more vectors.
Dataframes
- Dataframes are the main way we interact with data in R.
- They consist in a set of columns with names, each of which is itself a vector.
- The sequence of “first elementns” in the dataframe’s columns is called the first “row”.
- Dataframes are like tables or matrices in that they have rows and columns, but it turns out there are other special datatypes in R for tables and matrices with slightly different properties.
- When you want to experiment with manipulating data, it is often helpful to create a fake dataframe that you know well. The
data.framefunction allows us to do this.
tomato.data <- data.frame(
variety = c("cherry", "roma"),
size = c("small", "medium"),
avg.weight.g = c( 20, 80 )
)
tomato.data variety size avg.weight.g
1 cherry small 20
2 roma medium 80
tomato.data$variety[1] "cherry" "roma"
tomato.data$avg.weight.g - 1[1] 19 79
nrow(tomato.data)[1] 2
ncol(tomato.data)[1] 3
Define a vector called mission and give it the values Apollo 11, Voyager 1, and Curiousity. Create another vector called launch.year and assign 1969, 1977, and 2011. Make one more vector called destination with values Moon, Interstellar Space, and Mars.
Now make a dataframe called nasa.missions that contains all this information and print it.
- Let’s do some data wrangling with
tidyverse.
Selecting
- How do we select columns in the dataframe?
select(births, year, month, date_of_month)# A tibble: 5,479 × 3
year month date_of_month
<int> <int> <int>
1 2000 1 1
2 2000 1 2
3 2000 1 3
4 2000 1 4
5 2000 1 5
6 2000 1 6
7 2000 1 7
8 2000 1 8
9 2000 1 9
10 2000 1 10
# ℹ 5,469 more rows
select(births, month, year, date_of_month)# A tibble: 5,479 × 3
month year date_of_month
<int> <int> <int>
1 1 2000 1
2 1 2000 2
3 1 2000 3
4 1 2000 4
5 1 2000 5
6 1 2000 6
7 1 2000 7
8 1 2000 8
9 1 2000 9
10 1 2000 10
# ℹ 5,469 more rows
select(births, -year)# A tibble: 5,479 × 5
month date_of_month date day_of_week births
<int> <int> <date> <ord> <int>
1 1 1 2000-01-01 Sat 9083
2 1 2 2000-01-02 Sun 8006
3 1 3 2000-01-03 Mon 11363
4 1 4 2000-01-04 Tues 13032
5 1 5 2000-01-05 Wed 12558
6 1 6 2000-01-06 Thurs 12466
7 1 7 2000-01-07 Fri 12516
8 1 8 2000-01-08 Sat 8934
9 1 9 2000-01-09 Sun 7949
10 1 10 2000-01-10 Mon 11668
# ℹ 5,469 more rows
births <- select(
births,
date,
year,
month,
day = date_of_month,
week_day = day_of_week,
births
)
head(births)# A tibble: 6 × 6
date year month day week_day births
<date> <int> <int> <int> <ord> <int>
1 2000-01-01 2000 1 1 Sat 9083
2 2000-01-02 2000 1 2 Sun 8006
3 2000-01-03 2000 1 3 Mon 11363
4 2000-01-04 2000 1 4 Tues 13032
5 2000-01-05 2000 1 5 Wed 12558
6 2000-01-06 2000 1 6 Thurs 12466
births <- rename( births, birth.count = births )
head(births)# A tibble: 6 × 6
date year month day week_day birth.count
<date> <int> <int> <int> <ord> <int>
1 2000-01-01 2000 1 1 Sat 9083
2 2000-01-02 2000 1 2 Sun 8006
3 2000-01-03 2000 1 3 Mon 11363
4 2000-01-04 2000 1 4 Tues 13032
5 2000-01-05 2000 1 5 Wed 12558
6 2000-01-06 2000 1 6 Thurs 12466
- Retrieve a subset of the columns by mentioning their names. Not we don’t have to use dollar signs!
- The order in which column names are mentioned is reflected in the output.
- Remove a column with the minus sign
- Optionally rename columns when you mention them.
- If you just want to rename a column without having to mention everything else, use
rename.
Functions and reading documentation
We’ve seen lots of functions in R, but what are they really? How do we find them? And how do we know how to use them?
Suppose we want to know how to combine two strings. We can ask a chatbot like ChatGPT for the appropriate function.
class.first.names <- c("Alfred", "Grace", "T")
class.last.names <- c("Packer", "Hopper", "Swift")
# This line makes an error because the comma is missing
# class.names <- paste(class.first.names class.last.names)
class.names <- paste(class.first.names, class.last.names)
class.names[1] "Alfred Packer" "Grace Hopper" "T Swift"
- Let’s do something trickier. Suppose we have a vector of data.
my.data <- c(1, 2, 3, 4, 5)- I’d like to take a random sample from my.data. Let’s ask a chatbot again. We can copy the provided example.
random.sample <- sample(my.data, size = 3) # Extracts a random sample of 3 elements
random.sample[1] 5 2 1
- Maybe this isn’t exactly what we wanted. If we were using bootstrap resampling to simulate a larger dataset from a smaller one, we might want to sample with replacement.
- In addition to asking a chatbot, we can use the documentation.
help("sample")Here are the main parts of every documentation page
- Description is plain text for function
- Usage shows the call signature and default argument values
- R knows about TRUE and FALSE. More later.
- Arguments describes the meaning of each argument, and what type of data has to be supplied
- Details contains various notes about usage
- Value is the output of the function
- References shows who to cite if you use this function
- See Also is related functions that might be helpful
- Examples. Can run these with button.
bootstrap.resample <- sample(my.data, size = 20, replace = TRUE)
bootstrap.resample [1] 2 1 5 2 4 3 2 5 2 4 1 3 4 3 1 4 1 5 1 5
- Note you can also use the
samplefunction to produce a random permutation of a vector. This is useful for permutation tests.
permuted <- sample(my.data)
permuted[1] 1 5 3 4 2
- R has lots of built in operations you will learn over time.
Check out the documentation for the seq function. Answer the following questions with the documentation. If you need to, create a code chunk to try things out. Try not to use a chatbot. See hints below
- In your own words, what does the seq function do?
- What is the default value for the
toargument? - What does the
byargument do? - What does the length.out argument do?
Hint: Read the details section carefully, including the description of typical usages. Remember also that there are examples at the bottom.
Write code that finds the sum of the first 9 odd numbers. You may find the sum function helpful.
Using only calls to the seq function and arithmetic operators, generate the following sequence: 5, 8, 9, 8, 5.
Filter and Sort
- We can also filter with conditional statements
# Using data in births, retrieve only the rows in years greater or equal to 2010
filter(births, year >= 2010) %>% head()# A tibble: 6 × 6
date year month day week_day birth.count
<date> <int> <int> <int> <ord> <int>
1 2010-01-01 2010 1 1 Fri 7871
2 2010-01-02 2010 1 2 Sat 6990
3 2010-01-03 2010 1 3 Sun 6988
4 2010-01-04 2010 1 4 Mon 11844
5 2010-01-05 2010 1 5 Tues 12995
6 2010-01-06 2010 1 6 Wed 12520
# Same, but also only Saturdays
filter(births, year >= 2010, week_day = "Sat")Error in `filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `week_day == "Sat"`?
# There is an error above due to the single = sign. Use == for comparison.
# Try again with ==
filter(births, year >= 2010, week_day == "Sat") %>% head()# A tibble: 6 × 6
date year month day week_day birth.count
<date> <int> <int> <int> <ord> <int>
1 2010-01-02 2010 1 2 Sat 6990
2 2010-01-09 2010 1 9 Sat 7809
3 2010-01-16 2010 1 16 Sat 8032
4 2010-01-23 2010 1 23 Sat 7996
5 2010-01-30 2010 1 30 Sat 8035
6 2010-02-06 2010 2 6 Sat 7948
# Sort ascending by births
arrange(births, births) %>% head()# A tibble: 6 × 6
date year month day week_day birth.count
<date> <int> <int> <int> <ord> <int>
1 2000-01-01 2000 1 1 Sat 9083
2 2000-01-02 2000 1 2 Sun 8006
3 2000-01-03 2000 1 3 Mon 11363
4 2000-01-04 2000 1 4 Tues 13032
5 2000-01-05 2000 1 5 Wed 12558
6 2000-01-06 2000 1 6 Thurs 12466
# Sort descending by births
arrange(births, desc(births)) %>% head()# A tibble: 6 × 6
date year month day week_day birth.count
<date> <int> <int> <int> <ord> <int>
1 2014-12-31 2014 12 31 Wed 11990
2 2014-12-30 2014 12 30 Tues 13634
3 2014-12-29 2014 12 29 Mon 12811
4 2014-12-28 2014 12 28 Sun 7724
5 2014-12-27 2014 12 27 Sat 8656
6 2014-12-26 2014 12 26 Fri 10386
# Combine operations so far, plus tail, which shows just the last 6 elements
tail(arrange(select(filter(births, year >= 2010, week_day == "Sat"), -date), desc(births)))Error in `arrange()`:
ℹ In argument: `..1 = births`.
Caused by error:
! `..1` must be size 261 or 1, not 5479.
- Getting kinda hard to read? Introducing the pipe.
births %>%
select(-date) %>%
filter(year >= 2010, week_day == "Sat") %>%
arrange(desc(birth.count)) %>%
tail()# A tibble: 6 × 5
year month day week_day birth.count
<int> <int> <int> <ord> <int>
1 2011 11 26 Sat 7512
2 2012 4 7 Sat 7384
3 2011 1 1 Sat 7254
4 2010 1 2 Sat 6990
5 2011 12 24 Sat 6688
6 2010 12 25 Sat 6159
- The above syntax is given by the
magrittrpackage, contained intidyverse. R also has a native pipe operator|>with similar functionality but slightly different syntax and limitations. Just so you’re aware.
Using the births data set, select the date column and the week_day column, filter to only rows from January 1st, arrange by births descending. Don’t forget to use pipes (%>%).
Among weekends in the year 2010, which day (Sat or Sun) typically has fewer births? The exception is the weekend day with the overall smallest number of births. Why is that day different?
Boolean values
- When we used filter, we had an expression like
year >= 2010. What was happening there, and how did R understand it? - R understands the difference between right and wrong, true and false. They are called booleans.
FALSE[1] FALSE
TRUE[1] TRUE
TRUE == falseError in eval(expr, envir, enclos): object 'false' not found
TRUE == FALSE[1] FALSE
TRUE != FALSE[1] TRUE
FALSE == FALSE[1] TRUE
!FALSE[1] TRUE
-1 > 0[1] FALSE
2 >= 2[1] TRUE
3 >= 2[1] TRUE
1.01 <= 1.02[1] TRUE
"a" < "b"[1] TRUE
"a" < "A"[1] TRUE
"A" < "b"[1] TRUE
"llama" >= "ladel"[1] TRUE
- Like arithmetic, comparison operators are vectorized
- The syntax 1:5 is a shortcut for the vector of integers from 1 to 5 like c(1, 2, 3, 4, 5)
1:5[1] 1 2 3 4 5
1:5 < 3[1] TRUE TRUE FALSE FALSE FALSE
seq(1,10,2)[1] 1 3 5 7 9
1:5 < seq(1,10,2)[1] FALSE TRUE TRUE TRUE TRUE
c(TRUE, FALSE) | c(FALSE, FALSE)[1] TRUE FALSE
c(TRUE, FALSE) & c(FALSE, FALSE)[1] FALSE FALSE
When we wrote year >= 2010 we had a case similar in structure to 1:5 < 3. The left reference year is a vector of numerical year values, and the left reference 2010 is a constant number we wish to compare with every element of year. What filter does is compute a boolean vector of the same length as the number of rows in the dataframe, then include every row where this vector is TRUE, and exclude otherwise.
In R, TRUE values are analogous to the number 1, and FALSE values are analogous to the number 0. Test out my claim by creating a code chunk and testing if TRUE is equal to 1 and FALSE is equal to 0. Can you do both logical tests in one line?
What will happen if we apply the sum function to a vector of boolean values? Try it out below.
If I want to know the proportion of boolean values equal to TRUE instead of the count, we can use the arithmetic mean instead. Try it, and verify that it’s correct by using the sum and the length function together.
Find the proportion of all cats in the dataset named “Matilda”. The data’s a bit sloppy, so be careful to count lower case “matilda” as well.
ellerys.cats <- c("Matilda", "matilda", "matty", "Maverick", "Mallorie", "Matilda", "Matilda", "Missy", "Matilla", "Matt", "Maxine", "mallory", "matilda", "Max", "matilda", "Maxwell", "Mini", "Matilda", "Matthew", "Madison", "Maive", "matilda", "Mark", "mindy", "Matilda", "Mandy", "Melville", "Marisha", "matilda", "marissa", "Matilda", "Mable", "Maureen", "molly", "Matilda", "Matilda", "Moe", "Maha", "Melanie", "Melody", "matilda","Mina", "Manny")Mutation
- We’ve seen sorting, filtering, and summarizing. What’s the equivalent to formulas in Excel?
- Suppose I want to standardize the births column with z-scores, which I might do in preparation for regression.
- The
mutatefunction allows us to create new columns that are functions of the existing columns, or any data of the same length. - The syntax is:
data.frame %>% mutate( new.col.name = function.of.existing.cols )
births <- births %>%
mutate( birth.count.z = (birth.count - mean(birth.count)) / sd(birth.count) )
head(births)# A tibble: 6 × 7
date year month day week_day birth.count birth.count.z
<date> <int> <int> <int> <ord> <int> <dbl>
1 2000-01-01 2000 1 1 Sat 9083 -0.975
2 2000-01-02 2000 1 2 Sun 8006 -1.44
3 2000-01-03 2000 1 3 Mon 11363 0.00556
4 2000-01-04 2000 1 4 Tues 13032 0.723
5 2000-01-05 2000 1 5 Wed 12558 0.519
6 2000-01-06 2000 1 6 Thurs 12466 0.480
Suppose we’re interested in investigating whether more babies are born during warmer months.
Add a column to the birthdays data set that takes the value TRUE during warmer months (April through September), and FALSE during colder months (October through March); call it is.warmer. Hint: Check out the between function and see if you can use it.
Create a similar column called is.holiday. Use major federal holidays in the US. You might want to check out the documentation for case_when. This function allows you to test whether the input column matches any one of a list of tests (such as the date of a holiday), and outputs a corresponding value. The syntax for each of these tests is test.resulting.in.boolean.val ~ output. A chatbot could be very useful here too.
births <- births %>%
mutate( is.warmer = between(month, 4, 9))
head(births)# A tibble: 6 × 8
date year month day week_day birth.count birth.count.z is.warmer
<date> <int> <int> <int> <ord> <int> <dbl> <lgl>
1 2000-01-01 2000 1 1 Sat 9083 -0.975 FALSE
2 2000-01-02 2000 1 2 Sun 8006 -1.44 FALSE
3 2000-01-03 2000 1 3 Mon 11363 0.00556 FALSE
4 2000-01-04 2000 1 4 Tues 13032 0.723 FALSE
5 2000-01-05 2000 1 5 Wed 12558 0.519 FALSE
6 2000-01-06 2000 1 6 Thurs 12466 0.480 FALSE
Grammar of graphics
Let’s have a taste of visualization with
ggplot2. This package is loaded automatically when you calllibrary(tidyverse). Though you have the option of usinglibrary(ggplot2)on its own if you don’t want the rest of the tidyverse.Suppose I’m interested in the number of births over time.
# One line at a time
births.time <- ggplot(
data = births,
mapping = aes(
x = date,
y = birth.count,
color= week_day
)
) +
geom_point(size = 2, alpha = 0.1) +
geom_smooth(method = "loess") +
theme_bw() +
labs(
x = "Date",
y = "Births Count",
title = "Births Over Time",
color = "Day of Week"
)
births.time
ggsave("export/births.time.jpg", create.dir = TRUE)- The tidyverse data wrangling operations are changed together with pipes.
- In contrast, a ggplot is made up of stacked layers. So instead of piping, we add together the layers.
- The order of the layer matters; layes added later appear on top of existing layers.
- The ggplot function contains the “piece of paper” on which everything else is built. You can call it with no arguments. But we usually provide at least the data argument.
- Each argument to ggplot is like plot specifications that are inherited by every layer to come. More on that below.
- The mapping argument describes how the columns in the dataframe correspond to aesthetic elements of the plot. Here, we specify that the
xaxis is thedate, and theyaxis is thebirthscolumns. Thecoloraesthetic refers to the color of the points, lines, or outlines of plot elements. geom_layers are the plot types you have to choose from. There are lots!geom_pointis a scatter plot,geom_smoothis a summary curve like a linear regression or a moving average. There’s alsogeom_histogram,geom_density,geom_violin,geom_bar, and more!
- Each geom requires different aesthetics, and allows different aesthetics. The documentation for each one has a section describing the aesthetics it needs.
- Above, we specified aesthetic mappings and data in the
ggplotobject, so eachgeom_layer inherits this information. If we didn’t specify inggplot, we would have to provide thedataandmappingarguments explicitly in eachgeom_. This could be useful if you need different aesthetics or different data for eachgeom_.
- When you supply aesthetics to a
geomoutside themappingargument, you are tellingggplot2to use this visual aesthetic consistently across all data elements on the plot (as opposed to varying this aesthetic according to a variable). theme_layers are each a default collection of theming parameters. There are tons of possible customizations that you can look up in the documentation for thethemefunction. The prebuilttheme_layers are nice defaults that work in most circumstances. There’s alsotheme_minimal,theme_void, … Look up a list and try them out.labsdefine the labels used for different plot elements.ggsavetakes the last activeggplotobject and saves it to a file with the name given by its first argument, a string. The extension provided in this string determines the file type.- R also allows you to plot data on geographical maps or images, but we won’t cover that here.
Are more babies born in warmer months? In one figure, make a pair of violin plots to answer this question. Use geom_violin(). Be sure to create labels and titles. You’ll want to specify the x and y aesthetics.
See if you can figure out how to change the x axis so that is says “Warmer” and “Cooler”; a scale_x_discrete. You might also try to assign fills to the violins so that their colors correspond with “Warmer” and “Cooler”; analogously, use the appropriate aesthetic and a scale_fill_manual layer.
Some hints:
- limits defines the possible values appearing in the data that the scale will show. In a numerical column, it is numbers. In a boolean column, it is TRUE, FALSE. In a factor column, it is the names of the factors. ggplot will automatically select limits that cover the entire range of data. But the limits do not have to include all the values that actually appear in the data. Only the data included in limits will appear on the plot. - breaks is used together with labels. breaks denote values in the data that you wish to indicate on the plot. i.e. limits controls what data appears, and breaks controls which data are labeled. By default, it is the same as limits.
- By default, each element in breaks will be labeled with the data value itself: e.g. if the column contains TRUE then the label will be TRUE. If instead you want for all values equal to TRUE to be labeled with a different string, you specify it in the labels argument, one label to each of the breaks in the same order. - values is used for scales, like color/fill scales, where the data values correspond to some other aesthetic like blue/red or a line thickness. To each of the limits, there should correspond an element of values in the same order.
We’ve seen that weekend days have consistently fewer births by a significant margin. It might make sense to create separate plots, one comparison plot for weekdays and the other for weekends. Following a similar strategy as we did before, create a column called is.weekend. Recreate the same violin plot, but this time add a facet_grid layer. Read the documentation to see how to use it.
- Data wrangling extensions: Joining datasets and pivoting datasets
- Data pipeline extensions: Saving to csv, saving and writing R objects as .rds or .Rdata, saving or writing json files, retrieving data from an api.
- Programming tools extensions: for/while loops, if statements, mapping, writing custom functions.
Rendering quarto files
Simple
.Rfiles are useful if you have large chunks of function definitions or other code you want to run all in one go to get a single output. You can also call a.Rfile from within another file, such as this notebook by using thesourcefunction. This behavior is helpful when you want to reuse the same script in many notebooks.We’ve been working in a
.qmddocument, a Quarto notebook. Code notebooks allow you to combine code, figures, and text into a document that you can export as a webpage, a pdf document, or anything really. Quarto DocumentationThis notebook, as we’ve been doing, also allows you execute small chunks of code at time. This behavior saves you from having to copy and paste code into the console, or highlight and right click on code. You’ve grouped it already into helpful code chunks.
There are many customization options available at the document level. For example, setting a table of contents, which output format to make, a title, an author, any global option you want applied to every chunk.
There are also customization options at the chunk level. For example, Choose whether code is executed, whether it is included in the output, whether errors are included…. We can also refer to figures, such as Figure 1 by their label. We can change figure dimensions too.
Rendering allows you to compile your document for easy sharing and viewing with anyone. It’s useful for documenting reports about your work with others.
To render, click the document outline in the top middle part of the
qmdfile’s pane.Common frustrating issue: the code works when you run it cell by cell but not when quarto renders. What gives? Usually, this is because you have some variable or data in your current environment that was created at one point in your file, but then you erased the code. When quarto runs, it starts fresh, so that deleted code is never run. To find the error, clear your R environment and then run the code cell by cell from the top.
To save a workspace, you can use
save.imageto produce anRDatafile. In the future, loading the file will put all your R objects back into the workspace as they were when they were saved. This is the fastest and most efficient option, but these files cannot be opened by any program other thanR.
save.image(file = "export/rfundamentals.Rdata")Later, we can use
load("export/rfundamentals.Rdata")to retrieve everything.After you finish off some exercises, try rendering this notes file again to see you work appear.
Modeling Numerical Data
This section introduces linear models, including t-tests and ANOVA.
Student’s t-test
- Now suppose we wish to conduct a statistical test to determine whether the difference in births in warmer / cooler months is of significance.
library(rstatix)rstatix supports the same piping with dataframes that tidyverse does.
We would like to perform a t test using the
t_testfunction. Let’s ask a chatbot how to use it.Formulas in R are indicated like
left.side ~ right.side. They are usually used as inputs to functions when we wish to specify a relationship betweenleft.sideandright.side. For example, in regression theleft.sideis oftenyor theresponse, and theright.sideisx1 + x2 + ...the predictors. They can be used in other contexts too, and the documentation for a function will specify what is supposed to be on each side of the formula.
births %>% t_test( birth.count ~ is.warmer, detailed = TRUE)# A tibble: 1 × 15
estimate estimate1 estimate2 .y. group1 group2 n1 n2 statistic
* <dbl> <dbl> <dbl> <chr> <chr> <chr> <int> <int> <dbl>
1 -404. 11148. 11552. birth.count FALSE TRUE 2734 2745 -6.46
# ℹ 6 more variables: p <dbl>, df <dbl>, conf.low <dbl>, conf.high <dbl>,
# method <chr>, alternative <chr>
- How should we interpret this result? Is statistical significance relevant here? Depends on the context.
- If we are asking: “Should we increase staffing at the hospital during warmer months?” Then no, the significant result doesn’t matter because the total average difference in births per day nationally is just 404.
- If we are asking: “Do humans instinctually attempt to give birth in the summer? Perhaps there is higher fertility in the fall and winter?” The test implies the effect is small, but meaningful. Something is going on worth investigating.
The Welch’s t-test, which the t_test function performs by default, assumes that each sample is taken from a normal distribution with unknown variance–not necessarily equal to one another. The test tries to determine whether the two source distributions have the same mean.
First, based on our investigation so far, do you expect normal distributions? Why or why not?
Let’s test out the normality assumption formally. First, produce two histograms, one for each sample, using your knowledge of ggplot2. Use your resources to determine which geom_ to use and which aesthetics to provide. You may wish to use a facet_grid or an additional aesthetic to distinguish between the two samples.
To accompany your histograms, you should now also create a QQ plot. The y-axis is the quantile of the data distribution, and the x-axis is the quantile of the reference distribution–i.e. normal in this case. The plot contains one point per observation in the data. If the data distribution is normal, the plot will form a line.
Use geom_qq and geom_qq_line to build a QQ plot with ggplot2.
Linear Regression and ANOVA
- Linear regression allows us to establish whether some response variable is associated with some predictor variables.
- Lets revisit the
irisdataset we inspected earlier. - The
lmfunction is builtin to R.lmstands for linear model. Let’s read the documentation.
help(lm)- The “formula” object strikes again. In the ‘Details’ section, we see that the left side expects a response variable, and the right side expects predictor variables. The variables are separated with plus signs. One can create interaction terms with the
*or the:, but we won’t use that here. - Let’s model
Petal.Lengthas a function ofPetal.Width. It seems reasonable to guess that these quantities might be positively correlated. But the relationship between the two probably differs between species. Thus, we will want to look at an interaction betweenPetal.WidthandSpecies.
petal.len.wid <- lm(data = iris, formula = Petal.Length ~ Petal.Width * Species )
class(petal.len.wid)[1] "lm"
summary(petal.len.wid)
Call:
lm(formula = Petal.Length ~ Petal.Width * Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-0.84099 -0.19343 -0.03686 0.16314 1.17065
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3276 0.1309 10.139 < 2e-16 ***
Petal.Width 0.5465 0.4900 1.115 0.2666
Speciesversicolor 0.4537 0.3737 1.214 0.2267
Speciesvirginica 2.9131 0.4060 7.175 3.53e-11 ***
Petal.Width:Speciesversicolor 1.3228 0.5552 2.382 0.0185 *
Petal.Width:Speciesvirginica 0.1008 0.5248 0.192 0.8480
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3615 on 144 degrees of freedom
Multiple R-squared: 0.9595, Adjusted R-squared: 0.9581
F-statistic: 681.9 on 5 and 144 DF, p-value: < 2.2e-16
anova(petal.len.wid)Analysis of Variance Table
Response: Petal.Length
Df Sum Sq Mean Sq F value Pr(>F)
Petal.Width 1 430.48 430.48 3294.5561 < 2.2e-16 ***
Species 2 13.01 6.51 49.7891 < 2.2e-16 ***
Petal.Width:Species 2 2.02 1.01 7.7213 0.0006525 ***
Residuals 144 18.82 0.13
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- The call to
lmfits the model.
- The
summaryfunction, which we previously used on a dataframe, gives us model fit information when applied to anlmobject. It is showing coefficient values and statistical tests for whether the coeffecient is significantly different from zero. Many R objects have a summary method that is unique to their needs. - The
anovafunction computes the corresponding anova test, which tests whether each model term, relative to the model as whole, explains a lot of the variance in the response variable.
Fit a linear model for any one of the species with Sepal.Length as the dependent variable and Petal.Length as the independent variable. Store the model in a variable.
This question is about interpreting the linear models. Pick any one of three above.
- What does petal length mean?
- What does sepal length mean?
- When sepal length increases by one centimeter, how much does the petal length change?
- Assess how strong the linear relationship is.
- Over what range of sepal lengths is this model useful?
- Interpret the Intercept term in this model.
- Find the ratio of the fitted variability in the residuals to the average magnitude of the predictions.
Determine whether the relationship between Sepal.Length and Petal.Length differs between species. If so, which ones?
predict(petal.len.wid) %>% head() 1 2 3 4 5 6
1.436861 1.436861 1.436861 1.436861 1.436861 1.546160
- We can retrieve predicted values from a fitted
lmobject with thepredictfunction. Likesummarydifferent objects have outputs that are specific to their type.
Visualizing model predictions
- What if I want to visualize my model? For complex models many variables, the process is difficult because it’s hard to plot more than two variables at a time.
- The
geom_linelayer accepts a sequence of points and outputs a line that connections them in order.
ggplot(
data = iris,
mapping = aes(x = Petal.Width, y = Petal.Length, color = Species)
) +
geom_point() +
geom_line(mapping = aes(y = predict(petal.len.wid)))This time, we had to modify the
yaesthetic for just thegeom_linelayer, so we specified it explicitly.The approach above will work for any kind of model. But for certain simple models,
ggplothas a builtin function for visualization. It’s useful for a quick glimpse.
ggplot(
data = iris,
mapping = aes(x = Petal.Width, y = Petal.Length, color = Species)
) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x)Fit a linear model for any one of the species with Sepal.Length as the dependent variable and Petal.Length as the independent variable. Store the model in a variable.
Use your resources to figure out how to make a plot similar to that above, but with the equations of the models and their \(R^2\) values right on the plot.
Linear mixed effects models
Many modern experiments utilized mixed designs for their data, such as repeated measures.
Lets look at a sample dataset to illustrate mixed linear modeling
library("lme4")
library("lmerTest")
head(sleepstudy) Reaction Days Subject
1 249.5600 0 308
2 258.7047 1 308
3 250.8006 2 308
4 321.4398 3 308
5 356.8519 4 308
6 414.6901 5 308
help(sleepstudy)We wish to model how reaction time is related to days of sleep deprivation.
This model is a candidate for mixed effects modeling because each subject was measured many times. We might experiment therefore that one or more model parameters might vary between subjects.
A fixed effect is like a global average. It tells us how an average subject’s reaction time changes with sleep deprivation days. Fixed effects are usually the quantities we wish to measure and draw conclusions from.
A random effect has two parts: the fixed effect it modifies, and the grouping variable that determines which data points all receive the same modification. Typically, a random effect grouping factor is a nuisance in the data that is not really of scientific interest. It is merely there to facilitate a better fit for the fixed effect.
- Which variable is the dependent variable?
- Which variable(s) form the fixed effect(s)? Don’t forget about the intercept!
- Which fixed effects might be modified by a random effect?
- What is the grouping variable for that random effect?
Using your answers to exercise 1 as guidance, create a plot that shows the dependent variable on the y axis, the independent variable on the x axis, and the grouping variable as the color aesthetic so you can identify which datapoints go together.
Days is the fixed effect, and Subject is the grouping variable for a random effect on Days and the intercept.
# Form the base layer with sleep study data and aesthetic assignments
ggplot(sleepstudy, aes(x = Days, y = Reaction, color = Subject)) +
# Add points for each observation
geom_point() +
# Connect each observation of the same subject
geom_line(aes(group = Subject)) +
# Add labels
labs(
title = "Reaction Times by Days of Sleep Deprivation",
x = "Days of Sleep Deprivation",
y = "Reaction Time (ms)"
)- Let’s see how to fit the model.
help(lmer)Let’s see an example
mx.effects <- lmer(
formula = Reaction ~ Days + (1 | Subject),
data = sleepstudy
)
summary(mx.effects)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Reaction ~ Days + (1 | Subject)
Data: sleepstudy
REML criterion at convergence: 1786.5
Scaled residuals:
Min 1Q Median 3Q Max
-3.2257 -0.5529 0.0109 0.5188 4.2506
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 1378.2 37.12
Residual 960.5 30.99
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 251.4051 9.7467 22.8102 25.79 <2e-16 ***
Days 10.4673 0.8042 161.0000 13.02 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
Days -0.371
The REML criterion at convergence is telling you how likely your data is under the fitted model. The number is not meaningful in and of itself, but it can be used to compare models with the same fixed effects. Lower (more negative) values are better.
Scaled residuals are each residual divied by the fitted residual standard error.
The random effects section shows how much the subjects (grouping variable) varies about the fixed effect, as well as a fitted measure of correlation between random effects. i.e. When a subject has a higher intercept, do they also have a larger effect for sleep deprivation as days go by?
The random effects section also shows how large the residuals are in the fitted model. Recall we scaled everything by the maximum of
Reaction, so this standard deviation refers to average variation of 5% of the max.The fixed effects are the parameters of interest that apply to the general population. This table has a similar interpretation to regular linear model.
The correlation section views the fixed effects as random variables where any particular observation is the result of sampling randomness. It suggests that if one repeated the sampling many times, we would tend to see that a larger effect of deprivation tends to be observed with a slightly lower average intercept term.
Finally, note that the
anovafunction works as before.
anova(mx.effects)Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Days 162703 162703 1 161 169.4 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Modify the model we created so that it also includes a random effect for the Days term. Is it a better model? How can you tell?
Create a regular old linear model with the same fixed effects. Do the fitted parameters differ? By how much?
Textual Analysis
This section covers using the tm library to prepare a Corpus, which we apply to wordclouds. Then it introduces several common string manipulations and the purrr package.
Reading in a csv review
- Let’s do a quick refresher on loading data from a csv file.
- The working directory is the file system location (“e.g. Documents”) in which R looks for files when you try to open them.
getwd()
# setwd("your directory here")- We’ll use
read_csvto load the data. (There is also aread.csvin R’s built inutilspackage, butread_csvfromtidyversehas a few nice extra options.)
library(tidyverse)
wos.raw <- read_csv("data/ClimateAndArt.csv")Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 12686 Columns: 70
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (59): Publication Type, Authors, Book Authors, Book Editors, Book Group ...
dbl (8): Cited Reference Count, Times Cited, WoS Core, Times Cited, All Dat...
lgl (3): Book Series Subtitle, Cited References, Meeting Abstract
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(wos.raw)Rows: 12,686
Columns: 70
$ `Publication Type` <chr> "J", "J", "J", "J", "J", "J", "J", "J", "…
$ Authors <chr> "Miles, M", "Dal Farra, R; Suarez, P", "C…
$ `Book Authors` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Book Editors` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Book Group Authors` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Author Full Names` <chr> "Miles, Malcolm", "Dal Farra, Ricardo; Su…
$ `Book Author Full Names` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Group Authors` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Article Title` <chr> "Representing nature: art and climate cha…
$ `Source Title` <chr> "CULTURAL GEOGRAPHIES", "LEONARDO", "REVI…
$ `Book Series Title` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Book Series Subtitle` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Language <chr> "English", "English", "Spanish", "English…
$ `Document Type` <chr> "Article", "Article", "Article", "Article…
$ `Conference Title` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Conference Date` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Conference Location` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Conference Sponsor` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Conference Host` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Author Keywords` <chr> "art; climate-change; ecology; representa…
$ `Keywords Plus` <chr> NA, NA, "SELF-DETERMINATION THEORY; PRO-E…
$ Abstract <chr> "Climate change is now an established sci…
$ Addresses <chr> "Univ Plymouth, Sch Art & Media, Plymouth…
$ Affiliations <chr> "University of Plymouth", "Concordia Univ…
$ `Reprint Addresses` <chr> "Miles, M (corresponding author), Univ Pl…
$ `Email Addresses` <chr> "M.F.Miles@plymouth.ac.uk", "ricardo.dalf…
$ `Researcher Ids` <chr> NA, NA, "Chen, Mei-Hsin/K-2867-2017", NA,…
$ ORCIDs <chr> NA, NA, "Chen, Mei-Hsin/0000-0002-2378-28…
$ `Funding Orgs` <chr> NA, NA, NA, NA, "FORMAS (SE); BELSPO (BE)…
$ `Funding Name Preferred` <chr> NA, NA, NA, NA, "FORMAS (SE); BELSPO (BE)…
$ `Funding Text` <chr> NA, NA, NA, NA, "Thanks to All the Kerour…
$ `Cited References` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Cited Reference Count` <dbl> 50, 2, 101, 88, 57, 90, 63, 112, 31, 8, 7…
$ `Times Cited, WoS Core` <dbl> 42, 1, 0, 7, 4, 28, 9, 14, 0, 7, 67, 3, 0…
$ `Times Cited, All Databases` <dbl> 42, 1, 0, 8, 4, 28, 9, 14, 0, 7, 67, 3, 0…
$ `180 Day Usage Count` <dbl> 3, 0, 5, 6, 3, 4, 0, 4, 0, 3, 1, 1, 0, 1,…
$ `Since 2013 Usage Count` <dbl> 25, 3, 5, 71, 17, 47, 24, 28, 2, 12, 31, …
$ Publisher <chr> "SAGE PUBLICATIONS LTD", "MIT PRESS", "UN…
$ `Publisher City` <chr> "LONDON", "CAMBRIDGE", "SAN JOSE", "ABING…
$ `Publisher Address` <chr> "1 OLIVERS YARD, 55 CITY ROAD, LONDON EC1…
$ ISSN <chr> "1474-4740", "0024-094X", "2215-2253", "0…
$ eISSN <chr> "1477-0881", "1530-9282", "2215-3934", "1…
$ ISBN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Journal Abbreviation` <chr> "CULT GEOGR", "LEONARDO", "R HUMANIDAD", …
$ `Journal ISO Abbreviation` <chr> "Cult. Geogr.", "Leonardo", "R. Humanidad…
$ `Publication Date` <chr> "JAN", "OCT", "JUL-DEC", "2-Jan", NA, "MA…
$ `Publication Year` <dbl> 2010, 2014, 2022, 2015, 2020, 2018, 2017,…
$ Issue <chr> "1", "5", "2", "1", NA, NA, "1", NA, "6",…
$ `Part Number` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Special Issue` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Meeting Abstract` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Start Page` <chr> "19", "493", NA, "39", NA, "95", "93", NA…
$ `End Page` <chr> "35", "493", NA, "54", NA, "105", "116", …
$ `Article Number` <chr> NA, NA, "e51060", NA, "100253", NA, NA, "…
$ DOI <chr> "10.1177/1474474009349997", "10.1162/LEON…
$ `DOI Link` <chr> "http://dx.doi.org/10.1177/14744740093499…
$ `Book DOI` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Early Access Date` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Number of Pages` <dbl> 17, 1, 18, 16, 15, 11, 24, 19, 16, 4, 24,…
$ `WoS Categories` <chr> "Environmental Studies; Geography", "Art"…
$ `Web of Science Index` <chr> "Social Science Citation Index (SSCI); Ar…
$ `Research Areas` <chr> "Environmental Sciences & Ecology; Geogra…
$ `IDS Number` <chr> "549EU", "AP6YE", "2T9FT", "AS7TP", "PK4R…
$ `Pubmed Id` <dbl> NA, NA, NA, NA, 33106769, NA, NA, NA, NA,…
$ `Open Access Designations` <chr> NA, NA, "gold, Green Submitted", NA, "Gre…
$ `Highly Cited Status` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Hot Paper Status` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Date of Export` <chr> "12/21/22", "12/21/22", "12/21/22", "12/2…
$ `UT (Unique WOS ID)` <chr> "WOS:000274029500002", "WOS:0003422237000…
$ `Web of Science Record` <chr> "View Full Record in Web of Science", "Vi…
wos <- wos.raw %>%
# Select and rename the three important columns
# Foreshadowing for the reqs of a Corpus
select(doc_id = `UT (Unique WOS ID)` , text=Abstract, title = `Article Title`)
# Note the use of backticks `column name` when refering to columns with "illegal" names containing special characters or spaces.
glimpse(wos)Rows: 12,686
Columns: 3
$ doc_id <chr> "WOS:000274029500002", "WOS:000342223700017", "WOS:000822772700…
$ text <chr> "Climate change is now an established scientific fact, and deal…
$ title <chr> "Representing nature: art and climate change", "RED CROSS/RED C…
The web of science is a research tool that aggregates research articals across many journals and helps researchers find and discover research they might be interested in.
There are many of these files in the directory that we need to combine. You could read them all in one by one, but if there were too many to type out we’d need a better solution. We’ll come back to this issue.
Prepping a corpus
We’d like to know what kinds of words are in these articles. For that, we use a wordcloud. The larger the word in the wordcloud, the more frequent it is in the dataset.
Let’s look up
wordcloud2
library(wordcloud2)
# Unfortunately, wordcloud2 didn't implement docs, go online instead.
help(wordcloud2)We need to create a sorted dataframe with two columns: word, frequency of word.
The
tmlibrary is the workhorse of text mining in R. It has many standardized functions that other packages depend on. Here are the basic steps to start working with anytmbased package.
See the documentation online
- Every
tmfunction accepts a Corpus as a starting point.woshas to contain a uniquedoc_idcolumn, atextcolumn, and then any other metadata we care to mention.
library(tm)
# Create a corpus
corpus <- Corpus( DataframeSource(wos) )- To make any Corpus, we start with an object containing the inputs, in this case a dataframe.
- We call the appropriate
Sourcefunction to inspect the data. The role of theSourcefunction is to ensure the data is in an interpretable format. - Next we call a
Corpusfunction. There are many! Both created intmas well as other packages. The goal of theCorpusfunction is to prep our data for a specific use.
The TermDocumentMatrix
- We turn now to a
TermDocumentMatrix. Each row corresponds to a word in the corpus, and each column corresponds to a document in the corpus. The entries are counts for how many times that word appeared in that document. This is a common format for text mining applications.
# Create a term-document matrix
tdm <- TermDocumentMatrix(
corpus,
control = list(
# bounds = list(global = c(min.docs.w.word, max.docs.w.word)),
tolower = TRUE,
removePunction = TRUE
# What else would you put here? We'll come back to this.
)
)
tdm<<TermDocumentMatrix (terms: 133709, documents: 12686)>>
Non-/sparse entries: 1597087/1694635287
Sparsity : 100%
Maximal term length: 272
Weighting : term frequency (tf)
Converting to the word.freq data.frame
- To obtain the total number of times each word occurs, we need to add up the number of times each word occurs in each document. These are all the entries in each row, so we’ll use
row_sumsfrom theslampackage.
# slam contains high performance ops for sparse matrices
# the tdm object is a sparse matrix
library("slam")
word.freq <- data.frame(row_sums(tdm))
head(word.freq) row_sums.tdm.
achieved, 20
addresses. 1
always 174
among 1347
and 108753
another 236
Nice! We’re nearly there. All that’s left is to get the form of the dataframe right.
Currently, our words are captured as rownames instead of a column. There’s a quick function to fix that from the
tibblepackage aptly calledrownames_to_column.
Building the wordcloud
- Finally, we’ll need to use the required column names for the
wordcloud2function:wordandfreq.
- It’s your job to finish things off.
Note the order of the exercises is wonky. The code will work once you complete Exercise 1.
Use the template code below make a word cloud. During this exercise, you will have to look up the documentation experiment with the various arguments of each function. Check out their effects on the wordcloud after each change by running this code chunk.
# Create a term-document matrix
tdm <- TermDocumentMatrix(
corpus,
control = list(
# Try adjusting this important parameter.
# bounds = list(local = c( min.occurences.in.doc, max.occurences.in.doc)),
tolower = TRUE,
removePunction = TRUE,
stopwords = TRUE #,
# What else would you put here? We'll come back to this.
)
)
# Exercise 1: Yes the number is correct. Start here.
# When you're finished, run this code cell. Your wordcloud should appear
# Create the frequency table
word.freq <- data.frame(row_sums(tdm)) %>%
rownames_to_column( arg.name = your.arg ) %>%
rename( freq = freq.col.from.tdm )Error in rownames_to_column(., arg.name = your.arg): unused argument (arg.name = your.arg)
# Exercise 3: Yes the number is correct. End here
# Create the wordcloud itself
wordcloud <- wordcloud2(
word.freq #,
# other.args to modify the appearance of the wordcloud?
)Error in `[.data.frame`(data, , 1:2): undefined columns selected
# Display the worldcloud
wordcloudError in eval(expr, envir, enclos): object 'wordcloud' not found
This code serves as a solution to question 1.
# Create a term-document matrix
tdm <- TermDocumentMatrix(
corpus,
control = list(
# Try adjusting this important parameter.
# bounds = list(local = c( min.occurences.in.doc, max.occurences.in.doc)),
tolower = TRUE,
removePunction = TRUE,
stopwords = TRUE #,
# What else would you put here? We'll come back to this.
)
)
# Exercise 1: Yes the number is correct. Start here.
# When you're finished, run this code cell. Your wordcloud should appear
# Create the frequency table
word.freq <- data.frame(row_sums(tdm)) %>%
rownames_to_column( var = "word") %>%
rename( freq = row_sums.tdm. )
# Exercise 3: Yes the number is correct. End here
# Create the wordcloud itself
wordcloud <- wordcloud2(
word.freq
)
# Display the worldcloud
wordcloudOptionally, if you want to save your wordcloud for later, you can use the following code.
# Optional: Saving a wordcloud
# Use htmlwidgets to save as an interactive file for your web browser.
library("htmlwidgets")
saveWidget(wordcloud, "export/wordcloud.html", selfcontained = TRUE)The first argument is the htmlwidget to save, the second argument is the name to give the file, and the third argument includes all the data required to generate it so it will always function properly.
String split
The
stringrpackage, part of thetidyversecontains a suite of functions dedicated to searching, modifying, splitting, or otherwise manipulating strings. Any researcher working with text programmitically will have to learn some of them.Suppose for instance we want to create a dataframe of all the ORCIDs appearing in this dataset.
Lets take a look at the ORCIDs column in the dataset.
wos.raw %>% select(ORCIDs) %>% head() ORCIDs
1 <NA>
2 <NA>
3 Chen, Mei-Hsin/0000-0002-2378-2821
4 Guy, Simon Charles/0000-0003-1507-5067; Heidrich, Oliver/0000-0002-6581-5572
5 Jorgensen, Bethany/0000-0002-8216-1398
6 Whitmarsh, Lorraine/0000-0002-9054-1040; Whitmarsh, Lorraine/0000-0002-9054-1040
The ORCIDs for each author are listed in the format: “Last, First Middle/ORCID”. For publications with multiple authors, the authors are separated by a semicolon and a space.
To solve this programming problem, we’re going to need to split each ORCIDs entry by the string “;” into separate rows. Then, we’re going to have to split by the string “/” into separate columns
wos.orcid <- wos.raw %>%
# Get just the name/orcid combos
select(name.orcid = ORCIDs) %>%
# Take each row and split it into multiple rows, one for each ;
separate_rows(name.orcid, sep = "; ") %>%
# Remove all the empty entries
filter(name.orcid != "") %>%
# Split again, this time into columns
separate_wider_delim(
# The name of the column to split
cols = name.orcid,
# The separater by which to split
delim = "/",
# The names of the new columns to create
names = c("full.name", "orcid"),
# A few invalid ORCIDs, drop them for now
too_many = "drop"
)
head(wos.orcid)# A tibble: 6 × 2
full.name orcid
<chr> <chr>
1 Chen, Mei-Hsin 0000-0002-2378-2821
2 Guy, Simon Charles 0000-0003-1507-5067
3 Heidrich, Oliver 0000-0002-6581-5572
4 Jorgensen, Bethany 0000-0002-8216-1398
5 Whitmarsh, Lorraine 0000-0002-9054-1040
6 Whitmarsh, Lorraine 0000-0002-9054-1040
In general, functions that split strings have an argument called either “sep” or “delim”.
There is also a function called
str_splitthat performs a similar function, but it returns a vector containing the split components instead of creating a new row or column.
Extend the code above such that there are two additional columns called “given.name” and “surname”.
wos.orcid %>%
# Remove all the empty entries from the last separate
filter ( comparison )Error in `filter()`:
ℹ In argument: `comparison`.
Caused by error:
! object 'comparison' not found
# Split again into new *columns*: "surname" and "given.name"Modify your code above such that it retains the full.name column.
Read about the unite function and experiment with using to reverse the split operations.
Searching strings
- The other most common string task is to search for strings that contain a pattern
mountains <- c(
"Bear Peak",
"Green Mountain",
"South Boulder Peak",
"Flagstaff Mountain"
)
query <- "Peak"
mountains[1] "Bear Peak" "Green Mountain" "South Boulder Peak"
[4] "Flagstaff Mountain"
str_detect(string = mountains, pattern = query)[1] TRUE FALSE TRUE FALSE
str_detectreturnsTRUEorFALSEonce for each element of the character vector argument “string”.There’s another very similar function built in to R called
grepl, just so you know.These functions are useful for filtering dataframes or vectors.
For good measure, there is a function called
str_replacethat searches a string for a match, and if it finds one, it replaces the match with a different string you provide.
Print the top few entries in your ORCID dataframe, pick any orcid, and find all the papers that person was an author on. Lots of people only have 1 paper, but you might be able to find others.
Searching this way is helpful because the ORCIDs are standard across journals, but the names themselves might be printed differently.
Reusing some of our code above with modifications, obtain a list of individuals who do not have both a surname and a given name listed.
Hint: Everyone with both name types has the same pattern in their full name.
Regular expressions
Sometimes, we need finer control over the strings we search on, split on, or replace with.
Regular expressions, or regex for short are very powerful pattern matching tools that allow you specify complex queries like: only the beginning/end of the string, strings containing “start” anything at all then “stop” but not strings that contain “start” and “stop” in other places, and so on.
Here’s an example of a powerful regular expression that detects whether any of the keywords in our dataset are accidentally repeated.
repeated.pattern <- "(?:^|; )([^;]+)(?:;.*; |; )\\1(?:$|;)"
wos.key <- wos.raw %>%
select(keywords = Author.Keywords) %>%
mutate(has.duplicate = str_detect(keywords, repeated.pattern) )Error in `select()`:
! Can't select columns that don't exist.
✖ Column `Author.Keywords` doesn't exist.
wos.key %>% filter(has.duplicate, nchar(keywords) < 88)Error in eval(expr, envir, enclos): object 'wos.key' not found
You could take a whole class on regex, but I thought you should see the basic concepts.
Here’s the documentation. It works in R too, but you do have to be careful to use an extra backlash everywhere a backslash appears in the documentation.
Suppose we want to extract all the publications where the lead author’s last name starts with “A”. We can’t just
str_detect“A” because there may be other authors whose names start with “A”. We could try tostr_spliton spaces and look for a capital “A” in the first element… but that’s starting to get messy.
wos.author <- wos.raw %>%
select(authors = Author.Full.Names)
head(wos.author %>% filter( str_detect(authors, "^\\s*A"))) authors
1 Aragon, Carolina; Buxton, Jane; Infield, Elisabeth Hamin
2 Anwar, Samy A.; Diallo, Ismaila
3 Asenjo Fernandez, Ignacio
4 Armour, Kyle C.; Bitz, Cecilia M.; Roe, Gerard H.
5 Antich-Homar, Helena; Hess, Katharina; Solaun, Kepa; Alleng, Gerard; Flores, Adrian
6 Anker, Suzanne
“^” matches the beginning of a string only.
“\s” is a character class that matches any white space character (such as a space, a tab, or a new line.). We’re accounting here for possible unwanted whitespace.
“” is a quantifier* that means: match zero or more of the previous character. This ensures we’re prepared for any amount of white space, including none.
“A” is just a character that matches… A
For good measure, there’s another nice way to handle the whitespace issue
head( wos.author %>% filter( str_detect( str_trim(authors), "^A"))) authors
1 Aragon, Carolina; Buxton, Jane; Infield, Elisabeth Hamin
2 Anwar, Samy A.; Diallo, Ismaila
3 Asenjo Fernandez, Ignacio
4 Armour, Kyle C.; Bitz, Cecilia M.; Roe, Gerard H.
5 Antich-Homar, Helena; Hess, Katharina; Solaun, Kepa; Alleng, Gerard; Flores, Adrian
6 Anker, Suzanne
- The
str_trimfunction removes whitespace from the beginning and end of strings. It’s good practice to just always apply this kind of cleaning to raw data.
Find all the publications whose last author has the first name “Lisa”. Here are the tools you need. Combine them in the correct way.
- $ matches the end of a string
- . matches any character
- “*” matches any number of the preceding character, including zero
- Every first name is preceded by “,”.
- Some, but not all, authors have additional names.
Find all the publications with the keyword “climate” alone without additional text like “climate change” or “rainy climate”.
- Every keyword is preceded by either “;” or the beginning of the string.
- Every keyword ends with either “;”or the end of the string.
- The | character will match exactly one of either the character on its left or on its right.
- Paretheses form capturing groups. The capturing part is not important. But you can use them, say, on either side of the | to indicate that you wish to match a whole sequence of characters on the left/right, or its alternative on the opposite side.
- \b matches any word boundary such as a space, punctuation or the beginning/end of a string.